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Abstract Body 



Background / Context: Accurate indicators of educational effectiveness are needed to 
advance national policy goals of raising student achievement and closing social/cultural based 
achievement gaps. If constructed and used appropriately, such indicators for both program 
evaluation and the evaluation of teacher and school performance could have a transformative 
effect on the nature and outcomes of teaching and learning. Measures based on value-added 
models of student achievement (VAMs) are gaining increasing acceptance among policymakers 
as an improvement over conventional indicators of performance. Much controversy exists, 
however, as to the best way to construct VAMs and to their optimal application. A plethora of 
methods has been developed (e.g., Sanders & Horn, 1994; Sanders et al., 1997; McCaffrey et al., 
2004; Raudenbush, 2009), and studies that compare estimates derived from different models 
have found substantial variability across methods (McCaffrey et al., 2004). Concerns remain 
that our understanding of these models is as yet limited and that incentives built around them 
may cause more harm than good, with teachers’ unions, in particular, reluctant to allow their 
constituents to be judged on the basis of measures that are potentially biased. The few studies 
that have attempted to validate VAMs have drawn different conclusions (e.g., Kane & Staiger, 
2008; Rothstein, 2008) , and questions about the validity of VAMs linger. 

Purpose / Objective / Research Question / Focus of Study: This paper is the first in a 
series of papers that aims to resolve controversies surrounding VAMs. Our research question is: 
How well do different estimators perform in estimating teacher effects in a commonly used 
VAM framework? 

We focus our study on six estimators — most of which are commonly used in the research 
literature and in policy applications involving teacher effects. We answer our research questions 
by first outlining the assumptions that must be met for each estimator to have good statistical 
properties in the context of a fairly common theoretical framework. We then apply the estimators 
to the task of recovering teacher effects in simulated data. We apply the estimators to different 
types of data, where each data set is generated to violate or maintain specific assumptions in the 
attempt to mimic different types of student grouping and teacher assignment scenarios, and 
compare their performance. 

Setting: We use simulated data and conduct the research at Michigan State University. 

Population / Participants / Subjects: We simulate data on students and teachers in 
grades 3, 4, and 5 in a hypothetical school district. 

Intervention / Program / Practice: There is no intervention used in our study. 

Research Design: Our empirical investigations consist of a series of Monte Carlo 
simulations to evaluate the quality of various VAM estimation approaches. We use artificially 
generated data to investigate how well different estimators recover true effects under different 
scenarios. These scenarios correspond to data generating processes (DGPs) that vary the 
mechanisms used to assign students to teachers. To data generated from each DGP, we apply the 
set of estimators discussed in Section 3. We then compare the resulting estimates with the true 
underlying effects. 

To isolate fundamental problems, we restrict the DGPs to a relatively narrow set of 
idealized conditions. We assume that test scores are perfect reflections of the sum total of a 



Kane and Staiger (2008) compare experimental VAM estimates for a subset of Los Angeles teachers with earlier 
non-experimental estimates for those same teachers and find that they are similar. Rothstein (2008) devises 
falsification tests that challenge the validity of VAM-based measures of teacher performance in North Carolina. 
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child’s learning and that they are on an interval scale that remains constant across grades. We 
assume that there are no time-varying child or family effects, no interactions between students 
and teachers or schools, no peer effects, and we assume that unobserved child-specific 
heterogeneity has constant effect in each time period. We also assume that the GDL assumption 
holds — namely, that decay in schooling effects is constant over time. Finally, there are no time 
effects embedded in our DGPs. Thus we provide idealized conditions under which to test the 
performance of the estimators in uncovering teacher effects. If they fail under these conditions, 
they will do worse under the more complex conditions of the educational system. 

To mirror the basic structural conditions of an elementary school system for, say, grades 
3 through 5 over the course of three years, we create data sets that contain students nested within 
teachers nested within schools, with students followed longitudinally over time. Our simple 
baseline DGP is as follows: 

A i3 = L4 i2 + dj3 + Pi3 + Ci + e t 3 

A t 4 = XA i3 + b, 4 + p i4 + Ci + e i4 (13) 

Aj 5 = XA i4 + <5,5 + po + Ci + e ,-5 

where Aa is a baseline score reflecting the subject-specific knowledge of children 
entering third grade, X is a time constant decay parameter, c, is a time-invariant child-specific 
growth effect, fin is the teacher-specific contribution to growth (the true teacher value-added 
effect), 8 n is the school-specific contribution to growth (the true school effect), and e„ is a 
random deviation for each student. (Because we assume that Aa depends on A l2 using the same 
decay X, it makes sense to think of Aa as a second-grade test score or a pre-test score.) Because 
we assume independence of e it over time, we are maintaining the common factor restriction in 
the underlying cumulative effects model. We assume that the time-invariant child-specific 
growth effect c, is uncorrelated with the baseline test score Aa. 

In the simulations reported in this paper, we do not include school effects, thus the 8 it are 
set to zero. The random variables Aa, fin, c„ and e„ are drawn from normal distributions, where 
we adjust the standard deviations to allow different relative contributions to the scores. In the 
simulations we report, the standard deviation of the student-specific effects is 0.5, the standard 
deviation of the teacher effects is 0.25, and the standard deviation of the e it is 1.0. In all cases, the 
specific values of the variables were randomly selected so the correlations among them was 
assumed to be 0. As a result, in the case where X = 1, the teacher effect explains approximately 
l/20 th of the variance of the gain score. We chose these parameters as a first approximation to 
generating data that would be consistent with data observed in state testing programs that have 
vertically scaled tests in place. In those testing programs, the effect size of student growth for 
one academic year of instruction ranges from .2 to .5. Because it is believed that few students 
actually decline in achievement after a year of instruction, the proportion of the combined 
change in performance due to all variables that is negative should be small. If the amount of 
growth on average is assumed to be .5 standard deviations, the standard deviation of student 
growth should be no larger than .5 so that 16% or less of the students will show negative growth. 
With the values for the generating model parameters given above, obtaining a standard deviation 
of student growth of less .5 or less requires a value for X of less than 1. A value of .5 was used 
for half of the simulations. There is also the assumption that the teacher effect is less than the 
student effect. We chose the value for the standard deviation of e it to be 1, indicating that it was 
responsible for more of the variation in test scores than the teacher and student fixed effects. 
However, we tested the sensitivity of our results to other parameterization choices, and they were 
relatively robust. 
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Our data structure has the following characteristics that do not vary across simulations: 

• 10 schools 

• 3 grades (3 rd , 4 th , and 5 th ) of scores and teacher assignments, with a base score in 2 nd 

grade 

• 4 teachers per grade 

• 20 students per classroom 

• 4 cohorts of students 

• No crossover of students to other schools 

To create different scenarios, we vary certain key features: the sorting of students and 
teachers into schools, the grouping of students into classes, the assignment of classes of students 
to teachers within schools, and the amount of decay in prior learning from one period to the next. 
Within each of the school-sorting cases, we study the 10 different mechanisms for the 
assignment of students outlined in Table 1. (Insert Table 1 here) Finally, we vary the decay 
parameter A as follows: (1) A = 1 (no decay or complete persistence) and (2) A = .5 (fairly strong 
decay). The DGPs chosen for each simulation reproduce scenarios in which the assumptions 
mentioned above either hold or are violated. Thus, we explore 3x10x2 = 60 different scenarios 
in this paper. In the future we will allow school effects; for now, our simulations set the school 
effects to zero. We use 500 Monte Carlo replications per scenario in evaluating each estimator. 

Analysis: The following are the estimating equations we use, reflecting the 
simplifications determined by our DGPS. Specifically, we remove the time -varying intercept 
because our data have no time effects, we have no time-varying child and family effects, and we 
assume that n,= \ : 

AA it =E it f5o+ a + e it ( 14 ) 

A it =k A/J.J+ E it [So+ Ci+ e it ( 15 ) 

A A ir = kAA it _i+ AE it Po+ Ae u ( 16 ) 

where E it is the vector of teacher dummies. 

For each of the 500 iterations pertaining to one DGP, we estimate effects for each of the 
120 teachers using one of six estimation methods: pooled OLS (POLS) applied to (14) - so that 
the presence of c, is effectively ignored in estimation - random effects (RE) applied to (14), 
fixed effects (FE) applied to (14), POLS applied to (15) (which we have called dynamic OLS or 
DOLS), instrumental variables (a simplified version of Arellano and Bond that we call AB for 
“first-difference instrumental variables”) applied to (16), and Blundell and Bond (BB) applied to 
(16). 

Evaluating the Estimators 

For each iteration and for each of the six estimators, we save the estimated individual 
teacher effects, which are the coefficients on the teacher dummies and collapse these data to the 
teacher level retaining the true teacher effects generated from the simulation in the data as well. 
To study how well the methods uncover the true teacher effects, we adopt some simple summary 
measures. We regress the estimated 119 teacher effects (where one teacher is the base case) on 
the true effects generated from the simulation. From this simple regression, we report the average 
coefficient and its standard deviation across the 500 simulations. Regressing the estimated 
teacher effects on the true teacher effects tells us whether the estimated teacher effects are 
correct when compared with the average teacher. To see this, recall that we can write a simple 
regression equation as 

Pj ~ P = d(Pj ~ P) + residual; (17) 
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where P is the estimated effect of teacher j (obtained using a particular estimation 
approach), p } is the true effect of teacher j, the overbars represent averages (across the 119 
teachers), and 9 is the simple regression coefficient. If 0 = 1 then a movement of P ; away from 
its mean is tracked by the same movement of the estimate away from its mean. In effect, P, — P 
is an unbiased estimator of P } — P for all / = 1, ... ,119. Notice that this is different from saying p, 
is unbiased for p } , something we cannot guarantee because the means of the /?, and the P } will 
typically differ. (In effect, the mean value of the P s is not identified, so we can only ask how well 
the estimators are recovering the teacher effects compared with the mean.) 

If 0 > 1 then the estimation procedure is magnifying the differences of teachers relative 
to the mean. For example, a teacher that is slightly above average can look much better than the 
average because Pj — P is multiplied by 0. Similarly, a teacher just below average can appear 
much worse than average. If 9 < 1 (but positive) then the estimation method shrinks the 
estimated effects closer to its mean, making it more difficult than it should be to distinguish any 
teacher from the average. Because there is always sampling error in the estimates, we report the 
average of the 0, which we denote 0, across the 500 replications. 

The standard deviation of the 6 across simulations tells us how much the estimate 
deviates from its average. So, if the average value, 9, is near one and the standard deviation is 
small, then much of the time the estimated effects are properly tracking the true effect. A larger 
standard deviation means the estimates are correct on average but variable. 

As we already discussed, if 9 is very far from one and the standard deviation is small, 
then the estimated effects are systematically magnifying or compressing the teacher effects 
(relative to the mean, as always). Coupled with a small standard deviation we can conclude that 
the estimated teacher effects are far off in most simulations. This is an undesirable situation. 

We also report the average Spearman rank correlation between the estimated and true 
effects across the 500 iterations. We do so because, as we mentioned, certain policy applications 
of value-added measures of teacher performance may rely solely on rankings and not on the 
magnitude of the deviation from the average. Thus the rank correlation tells us how well the 
estimators preserve the true rankings. 

Findings / Results: Our findings for the case in which both students and teachers are 
randomly sorted across schools and when k=l are shown in Tables 2 A and 2B (full findings not 
reported in this abstract). All estimators work well when students are randomly grouped in 
classrooms and classrooms are randomly assigned to teachers. Our main finding is that no one 
method is guaranteed to reliably capture true teacher effects in all contexts, although some are 
more robust than others. Because we consider a variety of DGPs, student grouping mechanisms, 
and teacher assignment mechanisms, it is not surprising that no single method works well in all 
contexts. That the POLS and RE estimators work well under random assignment (regardless of 
how students are grouped), even when X is much less than unity, suggests that the estimators are 
fairly robust in such situations with only misspecified dynamics. Unfortunately, misspecified 
dynamics leads to much worse performance when assignment is based on past performance or on 
the unobserved student effect. Thus, these estimators cannot be trusted for certain realistic 
assignment mechanisms. Instead, estimating a simple dynamic regression by OLS was superior - 
often by a large margin - in situations of dynamic grouping of students with nonrandom 
assignment to teachers. 

The FE estimator is fairly robust to dynamic misspecification under random assignment 
and to either form of static assignment. The downside is that dynamic misspecification in 
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conjunction with dynamic assignment cause FE to be very badly behaved: the rank correlation 
between the estimated and true teacher effects is actually negative and almost .5 in magnitude in 
the case with the most decay. Again, the DOLS estimator works much better. One caveat is that 
our simulations use a fairly small variance for the student effect in the gain-score equation. The 
evidence suggests DOLS will work less well as student heterogeneity becomes relatively more 
important. 

The simulations show that the Arellano and Bond approach can be an attractive 
alternative to FE, especially when FE is subject to dynamic misspecification. In fact, except in 
two cases of dynamic grouping of students - DG-RA and DG-NA - AB outperforms FE when 
X < 1. However, it is clearly inferior to DOLS in cases of dynamic assignment. Overall, 
however, the performance of estimators that attempt to eliminate heterogeneity suffers in 
contexts in which teachers are non-randomly sorted across schools; these estimators are no 
longer reliable. 

Our findings suggest that choosing estimators on the basis of structural modeling 
considerations may produce inferior results. The DOLS estimator is never the prescribed 
approach under the structural cumulative effects model with a geometric distributed lag (unless 
there is no student growth heterogeneity), yet it is often the best estimator. That is not to say that 
the general cumulative effects model is incorrect. It merely reflects the fact that the assumptions 
needed to make it tractable - linearity, the geometric distributed lag, the common factor 
restriction - along with the possibility of nonrandom assignment may yield estimators that are 
poorly behaved. The findings in this paper, though special, suggest that flexible approaches 
based on dynamic treatment effects (for example, Lechner (2008), Wooldridge (2010, Chapter 
21)) may be fruitful: one can think of the DOLS estimator as a regression-based version of a 
dynamic treatment effects estimator. 

Conclusions: This study has taken the first step in evaluating different value-added 
estimation strategies in the conditions under which they are most likely to succeed. It is clear 
from this study that many VAMs hold promise: they may be capable of overcoming obstacles 
presented by non-random assignment and provide valuable information. Given their context- 
dependency, however, caution must be applied in interpreting the findings that VAMs, as 
currently applied in practice and in the research literature, are producing. They may best be 
viewed as suggestive evidence of effects, at this point. In addition, methods of constructing 
estimates of teacher effects that we can trust for high-stakes evaluative purposes must be further 
studied. Clearly, although value-added measures of teacher performance hold promise, before 
they can be used for policy purposes, more research is needed. There is much left to investigate. 
An important issue to study is the effect of failure of the common factor restriction on the 
various procedures. The DGP that we used would be much more flexible if we allow the errors in 
the cumulative effects formulation to be an unrestricted AR(1) process. As mentioned earlier, 
with more general DGPs we should add dynamic treatment effects estimators to the list of 
competitors. In addition, if contextual problems — grouping and assignment mechanisms — can be 
deduced from available data, then it may be possible to determine which estimators should be 
applied in a given context. For this purpose, structural modeling considerations may be helpful in 
that they yield tests that have the potential to identify violations of particular assumptions. In 
addition to the relaxation of assumptions and the development of contextual diagnostics, further 
research is needed regarding the ability of more detailed astructural approaches to detect 
treatment effects with greater accuracy. Investigations in these areas are the subject of current 
research in progress by the authors. 
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Appendix B. Tables and Figures 
Table 1: Grouping and Assignment Acronyms 



Acronym 


Process for grouping students in classrooms 


Process for assigning students to teachers 


RG-RA 


Random 


Random 


DG-RA 


Dynamic (based on prior test scores) 


Random 


DG-PA 


Dynamic (based on prior test scores) 


Positive correlation between teacher effects and prior 
student scores (better teachers with better students) 


DG-NA 


Dynamic (based on prior test scores) 


Negative correlation between teacher effects and 
prior student scores 


BG-RA 


Static based on baseline test scores 


Random 


BG-PA 


Static based on baseline test scores 


Positive correlation between teacher effects and 
baseline student scores 


BG-NA 


Static based on baseline test scores 


Negative correlation between teacher effects and 
baseline student scores 


HG-RA 


Static based on heterogeneity 


Random 


HG-PA 


Static based on heterogeneity 


Positive correlation between teacher effects and 
student fixed effects 


HG-NA 


Static based on heterogeneity 


Negative correlation between teacher effects and 
student fixed effects 
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Table 2 A: Average estimates of 0 and X over 500 replications in the teacher-level regressions of 
estimated effects on true effects: Case 1, X=1 



Estimator 

AssignmerrK 
Mechanism \. 


POLS 


DOLS 


RE 


FE 


FDIV 


BB 


RG-RA 


1.000 

(0.047) 


1.000 

(0.053) 

X=1.104 


1.001 

(0.045) 


0.997 

(0.096) 


0.959 

(0.103) 

X=906 


1.079 

(0.124) 

X=1.185 


DG-RA 


1.004 

(0.068) 


1.002 

(0.051) 

X=1.088 


1.005 

(0.060) 


0.895 

(0.193) 


0.689 

(0.216) 

A,=.363 


1.066 

(0.413) 

X=U62 


DG-PA 


1.343 

(0.058) 


1.000 

(0.068) 

X=1.107 


1.266 

(0.056) 


-0.410 

(0.135) 


0.480 

(0.211) 

X=.l3l 


-1.561 

(0.297) 

X=1.648 


DG-NA 


0.638 

(0.057) 


0.993 

(0.068) 

X=1.096 


0.700 

(0.055) 


2.086 

(0.157) 


0.540 

(0.257) 

A,=.269 


3.493 
(0.338) 
X= 1.630 


BG-RA 


1.002 

(0.047) 


1.002 

(0.063) 

X=1.188 


1.002 

(0.045) 


0.995 

(0.096) 


0.955 

(0.103) 

X=897 


1.076 
(0.126) 
A= 1 . 1 69 


BG-PA 


1.003 

(0.049) 


0.733 

(0.061) 

X=1.135 


1.002 

(0.048) 


0.988 

(0.102) 


0.950 

(0.111) 

A,=.895 


1.062 

(0.127) 

X=1.201 


BG-NA 


0.996 

(0.048) 


1.233 

(0.065) 

X=1.126 


0.997 

(0.046) 


0.988 

(0.110) 


0.949 

(0.117) 

X=903 


1.082 

(0.144) 

A=1.185 


HG-RA 


1.003 

(0.090) 


1.002 

(0.087) 

X=1.099 


1.003 

(0.075) 


0.999 

(0.097) 


0.970 

(0.101) 

A,=.953 


1.080 

(0.126) 

X=1.199 


HG-PA 


1.641 

(0.068) 


1.567 

(0.070) 

X=1.080 


1.506 

(0.061) 


0.995 

(0.094) 


0.977 

(0.097) 

A,=.853 


1.083 

(0.131) 

X=U77 


HG-NA 


0.360 

(0.068) 


0.402 

(0.067) 

X=1.090 


0.495 

(0.060) 


0.996 

(0.100) 


0.935 

(0.116) 

X=.930 


1.074 

(0.124) 

X=1.189 
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Table 2B: Average Spearman rank correlation over 500 replications between the estimated 
teacher effects and the true teacher effects: Case 1, X. =1 



'S Estimator 

Assignments 
Mechanism S. 


POLS 


DOLS 


RE 


FE 


FDIV 


BB 


RG-RA 


0.883 

(0.025) 


0.842 

(0.031) 


0.889 

(0.024) 


0.671 

(0.082) 


0.638 

(0.079) 


0.592 

(0.081) 


DG-RA 


0.782 

(0.039) 


0.837 

(0.031) 


0.816 

(0.036) 


0.378 

(0.085) 


0.255 

(0.078) 


0.229 

(0.089) 


DG-PA 


0.896 

(0.018) 


0.842 

(0.033) 


0.902 

(0.018) 


-0.307 

(0.083) 


0.188 

(0.083) 


-0.423 

(0.066) 


DG-NA 


0.640 

(0.063) 


0.827 

(0.034) 


0.709 

(0.058) 


0.779 

(0.048) 


0.199 

(0.084) 


0.693 

(0.053) 


BG-RA 


0.884 

(0.024) 


0.795 

(0.037) 


0.889 

(0.023) 


0.665 

(0.082) 


0.633 

(0.081) 


0.590 

(0.082) 


BG-PA 


0.885 

(0.023) 


0.705 

(0.056) 


0.890 

(0.022) 


0.669 

(0.081) 


0.638 

(0.082) 


0.592 

(0.082) 


BG-NA 


0.883 

(0.025) 


0.866 

(0.024) 


0.887 

(0.024) 


0.662 

(0.084) 


0.631 

(0.082) 


0.577 

(0.083) 


HG-RA 


0.694 

(0.049) 


0.702 

(0.048) 


0.755 

(0.042) 


0.672 

(0.082) 


0.647 

(0.080) 


0.588 

(0.083) 


HG-PA 


0.914 

(0.015) 


0.900 

(0.017) 


0.918 

(0.014) 


0.670 

(0.079) 


0.658 

(0.075) 


0.560 

(0.080) 


HG-NA 


0.387 

(0.087) 


0.420 

(0.082) 


0.556 

(0.074) 


0.668 

(0.078) 


0.602 

(0.084) 


0.610 

(0.076) 
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